33 research outputs found
Nowcasting user behaviour with social media and smart devices on a longitudinal basis: from macro- to micro-level modelling
The adoption of social media and smart devices by millions of users worldwide over the last decade has resulted in an unprecedented opportunity for NLP and social sciences. Users publish their thoughts and opinions on everyday issues through social media platforms, while they record their digital traces through their smart devices. Mining these rich resources offers new opportunities in sensing real-world events and indices (e.g., political preference, mental health indices) in a longitudinal fashion, either at the macro (population)-, or at the micro(user)-level.
The current project aims at developing approaches to “nowcast" (predict the current state of) such indices at both levels of granularity. First, we build natural language resources for the static tasks of sentiment analysis, emotion disclosure and sarcasm detection over user-generated content. These are important for opinion monitoring on a large scale. Second, we propose a general approach that leverages textual data derived from generic social media streams to nowcast political indices at the macro-level. Third, we leverage temporally sensitive and asynchronous information to nowcast the political stance of social media users, at the micro-level using multiple kernel learning. We then focus further on the micro-level modelling, to account for heterogeneous data sources, such as information derived from users' smart phones, SMS and social media messages, to nowcast time-varying mental health indices of a small cohort of users on a longitudinal basis. Finally, we present the challenges faced when applying such micro-level approaches in a real-world setting and propose directions for future research
Towards Real-Time, Country-Level Location Classification of Worldwide Tweets
In contrast to much previous work that has focused on location classification
of tweets restricted to a specific country, here we undertake the task in a
broader context by classifying global tweets at the country level, which is so
far unexplored in a real-time scenario. We analyse the extent to which a
tweet's country of origin can be determined by making use of eight
tweet-inherent features for classification. Furthermore, we use two datasets,
collected a year apart from each other, to analyse the extent to which a model
trained from historical tweets can still be leveraged for classification of new
tweets. With classification experiments on all 217 countries in our datasets,
as well as on the top 25 countries, we offer some insights into the best use of
tweet-inherent features for an accurate country-level classification of tweets.
We find that the use of a single feature, such as the use of tweet content
alone -- the most widely used feature in previous work -- leaves much to be
desired. Choosing an appropriate combination of both tweet content and metadata
can actually lead to substantial improvements of between 20\% and 50\%. We
observe that tweet content, the user's self-reported location and the user's
real name, all of which are inherent in a tweet and available in a real-time
scenario, are particularly useful to determine the country of origin. We also
experiment on the applicability of a model trained on historical tweets to
classify new tweets, finding that the choice of a particular combination of
features whose utility does not fade over time can actually lead to comparable
performance, avoiding the need to retrain. However, the difficulty of achieving
accurate classification increases slightly for countries with multiple
commonalities, especially for English and Spanish speaking countries.Comment: Accepted for publication in IEEE Transactions on Knowledge and Data
Engineering (IEEE TKDE
Predicting elections for multiple countries using Twitter and polls
The authors' work focuses on predicting the 2014 European Union elections in three different countries using Twitter and polls. Past works in this domain relying strictly on Twitter data have been proven ineffective. Others, using polls as their ground truth, have raised questions regarding the contribution of Twitter data for this task. Here, the authors treat this task as a multivariate time-series forecast, extracting Twitter- and poll-based features and training different predictive algorithms. They've achieved better results than several past works and the commercial baseline
Unsupervised Opinion Summarisation in the Wasserstein Space
Opinion summarisation synthesises opinions expressed in a group of documents
discussing the same topic to produce a single summary. Recent work has looked
at opinion summarisation of clusters of social media posts. Such posts are
noisy and have unpredictable structure, posing additional challenges for the
construction of the summary distribution and the preservation of meaning
compared to online reviews, which has been so far the focus of opinion
summarisation. To address these challenges we present \textit{WassOS}, an
unsupervised abstractive summarization model which makes use of the Wasserstein
distance. A Variational Autoencoder is used to get the distribution of
documents/posts, and the distributions are disentangled into separate semantic
and syntactic spaces. The summary distribution is obtained using the
Wasserstein barycenter of the semantic and syntactic distributions. A latent
variable sampled from the summary distribution is fed into a GRU decoder with a
transformer layer to produce the final summary. Our experiments on multiple
datasets including Twitter clusters, Reddit threads, and reviews show that
WassOS almost always outperforms the state-of-the-art on ROUGE metrics and
consistently produces the best summaries with respect to meaning preservation
according to human evaluations
Template-based Abstractive Microblog Opinion Summarisation
We introduce the task of microblog opinion summarisation (MOS) and share a
dataset of 3100 gold-standard opinion summaries to facilitate research in this
domain. The dataset contains summaries of tweets spanning a 2-year period and
covers more topics than any other public Twitter summarisation dataset.
Summaries are abstractive in nature and have been created by journalists
skilled in summarising news articles following a template separating factual
information (main story) from author opinions. Our method differs from previous
work on generating gold-standard summaries from social media, which usually
involves selecting representative posts and thus favours extractive
summarisation models. To showcase the dataset's utility and challenges, we
benchmark a range of abstractive and extractive state-of-the-art summarisation
models and achieve good performance, with the former outperforming the latter.
We also show that fine-tuning is necessary to improve performance and
investigate the benefits of using different sample sizes.Comment: Accepted for publication in Transactions of the Association for
Computational Linguistics (TACL), 2022. Pre-MIT Press publication versio
Building and evaluating resources for sentiment analysis in the Greek language
Sentiment lexicons and word embeddings constitute well-established sources of information for sentiment analysis in online social media. Although their effectiveness has been demonstrated in state-of-the-art sentiment analysis and related tasks in the English language, such publicly available resources are much less developed and evaluated for the Greek language. In this paper, we tackle the problems arising when analyzing text in such an under-resourced language. We present and make publicly available a rich set of such resources, ranging from a manually annotated lexicon, to semi-supervised word embedding vectors and annotated datasets for different tasks. Our experiments using different algorithms and parameters on our resources show promising results over standard baselines; on average, we achieve a 24.9% relative improvement in F-score on the cross-domain sentiment analysis task when training the same algorithms with our resources, compared to training them on more traditional feature sources, such as n-grams. Importantly, while our resources were built with the primary focus on the cross-domain sentiment analysis task, they also show promising results in related tasks, such as emotion analysis and sarcasm detection
Mining the UK web archive for semantic change detection
Semantic change detection (i.e., identify- ing words whose meaning has changed over time) started emerging as a grow- ing area of research over the past decade, with important downstream applications in natural language processing, historical linguistics and computational social sci- ence. However, several obstacles make progress in the domain slow and diffi- cult. These pertain primarily to the lack of well-established gold standard datasets, resources to study the problem at a fine- grained temporal resolution, and quantita- tive evaluation approaches. In this work, we aim to mitigate these issues by (a) re- leasing a new labelled dataset of more than 47K word vectors trained on the UK Web Archive over a short time-frame (2000- 2013); (b) proposing a variant of Pro- crustes alignment to detect words that have undergone semantic shift; and (c) intro- ducing a rank-based approach for evalu- ation purposes. Through extensive nu- merical experiments and validation, we il- lustrate the effectiveness of our approach against competitive baselines. Finally, we also make our resources publicly available to further enable research in the domain
Template-based Abstractive Microblog Opinion Summarization
We introduce the task of microblog opinion summarization (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarization dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarizing news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favors extractive summarization models. To showcase the dataset’s utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarization models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes